A novel microbe-drug association prediction model based on graph attention networks and bilayer random forest

Background In recent years, the extensive use of drugs and antibiotics has led to increasing microbial resistance. Therefore, it becomes crucial to explore deep connections between drugs and microbes. However, traditional biological experiments are very expensive and time-consuming. Therefore, it is meaningful to develop efficient computational models to forecast potential microbe-drug associations. Results In this manuscript, we proposed a novel prediction model called GARFMDA by combining graph attention networks and bilayer random forest to infer probable microbe-drug correlations. In GARFMDA, through integrating different microbe-drug-disease correlation indices, we constructed two different microbe-drug networks first. And then, based on multiple measures of similarity, we constructed a unique feature matrix for drugs and microbes respectively. Next, we fed these newly-obtained microbe-drug networks together with feature matrices into the graph attention network to extract the low-dimensional feature representations for drugs and microbes separately. Thereafter, these low-dimensional feature representations, along with the feature matrices, would be further inputted into the first layer of the Bilayer random forest model to obtain the contribution values of all features. And then, after removing features with low contribution values, these contribution values would be fed into the second layer of the Bilayer random forest to detect potential links between microbes and drugs. Conclusions Experimental results and case studies show that GARFMDA can achieve better prediction performance than state-of-the-art approaches, which means that GARFMDA may be a useful tool in the field of microbe-drug association prediction in the future. Besides, the source code of GARFMDA is available at https://github.com/KuangHaiYue/GARFMDA.git Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05687-9.


Background
A multitude of microbial communities, including bacteria, fungi, viruses, and other microbes, have been found in the human body, which are intimately linked to human health and are crucial to numerous physiological processes, including immune regulation, vitamin production, and the maintenance of digestive function [1,2].However, some microorganisms may be associated with the development of disease under specific circumstances.For instance, an imbalance of human gut bacteria can lead to the risk of high blood pressure [3].
In recent years, the misuse and irrational use of antibiotics, mutation and horizontal gene transfer of microbial genes, and the spread of microorganisms in the medical and social environments have led to microbial resistance to antibiotics, which makes effective antibiotic treatment ineffective and poses a serious challenge to clinical treatment [4].Therefore, in order to address the problem of microbial resistance, it is meaningful to develop efficient computational models to detect microbial resistance and find new antibiotics, because these computational models can infer latent microbe-drug associations and thus provide a simple and efficient way to address microbial resistance.
For the last few years, a number of databases of microbial-drug associations, including MDAD [5], aBiofilm [6], and Drugvirus [7], have been adopted by researchers to construct an abundance of calculation models to identify possible microbe-drug associations.For example, in 2019, Zhu et al. [8] created a prediction model named HMDAKATZ based on the KATZ measure.In 2021, Deng et al. [9] devised a method called Graph2MDA by constructing multimodal attribute graphs as inputs of variogram autoencoders to discover details about every node and the complete graph.Long et al. [10] introduced the metapath2vec scheme for learning low-dimensional embedded representations of microorganisms and drugs and designed a partial dichotomous network projection recommendation algorithm and proposed a novel calculation method named HNERMDA.In 2023, Ma et al. [11] combined graph attention networks and CNN-based classifiers to construct a model called GACNNMDA.Huang et al. [12] designed a model named GNAEMDA based on graph normalized convolutional networks.Cheng et al. [13] designed a model called NIRBMMDA based on the neighbourhood-based inference and the restricted Boltzmann machine.Li et al. [14] combined matrix decomposition and a three-layer heterogeneous network to create a model called MFTLHNMDA to infer microbe-drug associations.
In this article, in order to improve the performance of prediction models, we designed a new prediction model named GARFMDA by combining graph attention network (GAT) and two-layer random forest (RF).In GARFMDA, a two-layer GAT was adopted first to learn the low-dimensional feature representations of microbes and drugs.And then, a two-layer random forest model was introduced to obtain the contribution values of all features as well as predict possible associations between microorganisms and drugs after eliminating those low-contribution features.Additionally, we conducted extensive case studies and comparison experiments to assess the prediction performance of GAR-FMDA.And as a result, GARFMDA achieved satisfactory results in the field of possible microbe-drug relationship prediction and outperformed existing representative competing methods.

Data sources
In this section, we will first download known microbe-drug associations from the MDAD database (https:// figsh are.com/ search?q= 10. 6084% 2Fm9.figsh are.24798 456), which consists of 2470 validated microbe-drug associations, including 1373 drugs and 173 microbes.Subsequently, we will download additional data on microbe, drug and disease associations from the database proposed by Wang et al. [14], which contains 70,315 reported drug-disease connections and 15,633 reported microbe-disease connections.Following a rigorous screening procedure to eliminate disease-related correlations for which there is no known association between medications or microorganisms in the MDAD database, we finally obtain 109 unique drug-disease connections covering 1,121 drugs and 233 diseases, and 109 unique microbe-disease connections covering 402 microbes and 73 diseases from the database proposed by Wang et al.Furthermore, we have also gathered 138 known microbe-microbe interactions, encompassing 123 microbe in MDAD, and 5586 known drug-drug relationships, from the data collection created by Deng et al. [9], which covers 1228 drugs in MDAD.Additional files 1, 2, 3, 4, 5, 6, 7, 8 and Table 1 below provides information on the aforementioned facts.

Methods
As shown in Fig. 1, GARFMDA is composed of the following three main parts: Part 1: Firstly, based on the newly-downloaded datasets on microbes, drugs and diseases, two different heterogeneous microbe-drug networks HN 1 and HN 2 will be constructed.
Part 2: And then, based on multiple similarity metrics of microbe and drug, a feature matrix will be created for microbes and drugs separately, which will be then fed into the GAT along with HN 1 and HN 2 to learn the low-dimensional feature representations for microbes and drugs respectively.
Part 3: Finally, these two newly-obtained low-dimensional feature representations, along with two feature matrices, will be inputted into a two-layer random forest model to compute the probability scores of drug-microbe relationships.

Construction of two heterogeneous microbe-drug networks
For any given database D, let n r and n m stand for the numbers of drugs and microorgan- isms newly downloaded from D respectively, then we can construct a adjacency matrix D 1 ∈ R n r * n m between microbes and drugs as follows: for any given microbe m j and drug r i , if there is a known relationship between them in D, there is D 1 i, j = 1 , otherwise there is D 1 i, j = 0. Similarly, based on the newly-downloaded datasets of known connections between microbes and drugs, microbes and diseases, and drugs and diseases, we may create another microbe-drug adjacency matrix D 2 ∈ R n r * n m as follows: for a given microbe m j , drug r i and disease d k , if there exist a known relationship between m j and d k , as well as a known association between r i and d k , then there is D 2 i, j = 1 , otherwise there is D 2 i, j = 0.
Hence, based on above two adjacency matrices D 1 and D 2 , it is simple to build two heterogeneous microbe-drug networks HN 1 and HN 2 according to the following way: Firstly, in D v (v = 1, 2) , let D v (r i ) and D v m j denote the i-th row and j-th column of D v separately, then for any two given drugs r i and r j , we will calculate the Gaussian Interaction Profile (GIP) kernel similarity between them as follows: where ‖•‖ denotes the Frobenius norm.Obviously, based on above Eq.( 1), we can obtain a GIP kernel similarity matrix A v rg ∈ R n r * n r for drugs.In a similar way, for any two given microbes m i and m j , we can also calculate the GIP kernel similarity between them as follows: Obviously, based on above Eq.( 3), we can obtain a GIP kernel similarity matrix A v mg ∈ R n m * n m for microbes as well.
Next, based on the assumption that when two nodes have highly dissimilar interaction characteristics, they are less comparable to each other [15], for any two given drugs r i and r j , we will calculate the Hamming Interaction Profile (HIP) similarity between them as follows: Similarly, for any two given microbe m i and m j , the HIP similarity between them can be determined as follows: Hence, based on above Eqs.( 5) and ( 6), we can obtain two HIP similarity matrices A v rh ∈ R n r * n r and A v mh ∈ R n m * n m for drugs and microbes separately.Finally, for any two given drugs r i and r j , it is evident that we can construct an inte- grated similarity between them by integrating A v rg and A v rh as follows: (1) (2) Similarly, for any two given microbes m i and m j , we can construct an integrated similarity between them by integrating A v mg and A v mh as follows: Hence, based on above Eqs.( 7) and ( 8), we can finally obtain two new matrices H 1 ∈ R (n r +n m ) * (n r +n m ) and H 2 ∈ R (n r +n m ) * (n r +n m ) as follows: Obviously, based on the above two matrices H 1 and H 2 , two heterogeneous microbe-drug networks HN 1 and HN 2 can be constructed respectively.

Constructing unique feature matrix for microbes and drugs
In this section, we will first adopt the SIMCOMP2 [16] to determine the structural similarity between any two given drugs r i and r j , and obtain a new drug structural similarity matrix A rc .Next, we will utilize the method presented by Kamneva [17] to determine the functional similarity between any two given microorganisms m i and m j , and create a new microbe functional similarity matrix A mf .And then, we will further perform RWR [39] on A v r and A v m separately in the following way: In above equations, Q is the matrix of transition probabilities, q l i is the likelihood of node i transferring to the node l, and β i ∈ R 1 * n is the starting odds vector for the node i, and the j-th element in β i is defined as follows: Obviously, based on above Eqs.(11) and (12), we can obtain two different matrices A v rr and A v mm based on A v r and A v m respectively.Thereafter, based on above newly obtained matrices, we can construct a unique feature matrix to preserve more original features of microbes and drugs as follows: (7) if there is a known association between r i and r j A v rg (ri,rj)+A v rh (ri,rj) if there is a known association between m i and m j A v mg (mi,mj)+A v mh (mi,mj) where, From above Eqs.( 13), ( 14) and ( 15), it is clear that there is , where, k 1 represents the number of columns in S v .

The structure of the two-layer GAT
Encoder: To determine the degree of similarity between any given node i and one of its neighboring node j in H v (v = 1, 2) , we will compute the similarity coefficient e ij between them as follows: where S v (i) denotes the i-th row of S v , α is an operation for feature mapping, W v is a trainable weight matrix, ϕ v i is the collection of nodes that are adjacent to i in H v , and µ is a hyper-parameter varying between 0 and 1.
Based on above Eq.( 16), for any two given nodes i and j, then the attention score ρ ij between them can be calculated as follows: Obviously, based on above attention score ρ ij , a new feature of node i, representing the weighted sum of the features of its neighboring nodes, can be obtained as follows: Hence, we can construct a new feature representation matrix M v as follows: Here, k 2 represents the nunber of columns in M v .Decoder: Te decoder adopts the same structure as the encoder, and is defined as follows:: Optimization: Taking into account the fact that the reconstructed matrix differs from the raw matrix, we adopt the MSE loss factor to determine the average of the sum of differences squared between M ′v and H v .The MSE loss function is defined as follows: where M ′v (i) and H v (i) denote the i-th row of M ′v and H v respectively.
Finally, Finally, the Adam optimizer [40] will be further used to optimize the loss function in the model training process.
Furthermore, we present the workflow of the two-layer GAT in the following Fig. 2 for better understanding the implementation of the above two-layer GAT.

The structure of the two-layer random forest
Traditional machine learning, when faced with complex nonlinear patterns, may suffer from drawbacks such as overfitting problems and the inability to provide uncertainty estimates of the predicted outcomes [18].In order to calculate the potential scores of unknown drug-microbe relationships, we will create a two-layer random forest model in this section and treat the drug-microbe problem as a binary classification problem, which can improve the model effect and reduces the risk of overfitting through the selection of features in the first layer of the random forest.For the input of the first layer of the two-layer random forest, we will respectively construct two feature matrices B v r and B v m according to the following equations: And then, for any given drug r i and microbe m j , let B v r (i) and B v m j represent the i-th row of B v r and the j-th column of B v m respectively, and ( 22) Fig. 2 workflow of the two-layer GAT in GARFMDA where k 3 represents the number of columns in F v , then we will feed F v into the first layer of the bilayer random forest.Moreover, in the first layer of the bilayer random forest, we will assume that the number of decision trees is p and the maximum depth is s.And after training, we will compare the magnitude of the contribution made by each feature during the growth of each decision tree in the bilayer random forest by calculating the sum of the Gini index [19] changes of each feature over all the decision trees in the forest G(tr) to represent the contribution made by the feature C(tr) , which is defined as follows: where tr denotes the feature index, h represents the decision tree index, and m is the total number of features.Gini F v h (tr) denotes the Gini index on the decision tree h con- ditional on the feature tr.
After that, we will eliminate the features with contribution value less than L, and obtain a new feature matrix F ′v , which will be fed into the second layer of the bilayer random forest for training and prediction.Hence, we can obtain a score matrix finally.
Obviously, based on the matrices H 1 and H 2 , we can obtain two different score matri- ces Score 1 and Score 2 respectively.Therefore, we can construct an integrated score matrix S ∈ R n r * n m as follows:

Results
In this section, we will first examine the impact of parameters on the prediction performance of GARFMDA.And then, we will compare GARFMDA with five cutting-edge competitive prediction techniques.Finally, in order to illustrate the efficiency of GAR-FMDA, we will introduce some well-known drugs and microbes for case studies.

Sensitivity analysis of hyperparameters
From above descriptions, it is clear that there are some important parameters in GAR-FMDA, including the GAT learning rate, the GAT dropout rate, the maximum depth of the decision tree in the bilayer random forest, and the contribution value of these chosen features.In this section, we will execute 10 times of fivefold Cross Validation (CV) on MDAD to assess impact of these parameters on the effectiveness of GARFMDA for determining the best values of these parameters.
For simplicity, in experiments, we will use the abbreviations lr, dp, s and l to stand for the learning rate and the dropout rate of GAT, the maximum depth of the first and second layers of the decision tree in the bilayer random forest, and the contribution value of these chosen features, respectively.Firstly, we will evaluate the impact of lr on the prediction performance of GARFMDA while it varies in the range of {0.0001, 0.001, 0.01, (27) 0.05, 0.1}.From observing the following Fig. 3a, it is clear that when lr is set to 0,01, GARFMDA can achieve the highest value of AUC.Next, we will limit the value of dp to a range of {0.2, 0.4, 0.5, 0.7}, and as shown in Fig. 3b, it is obvious that when dp is set to 0.4, GARFMDA can achieve the highest value of AUC.Additionally, we will restrict the value of s to the range of {1, 3, 5, 7, 9} and as illustrated in Fig. 3c, it is evident that when s is set to 7, GARFMDA can achieve the highest value of AUC.Finally, we will limit the value of l to a range of {0.0001, 0.0005, 0.001, 0.0012, 0.0015}, and as shown in Fig. 3d, the performance of GARFMDA will reach to the best when l is set to 0.0012.As for the parameter pf of the number of random forest trees in the bilayer random forest, we found through comparative experiments that the effect of the value of pf on the prediction performance of GARFMDA is not significant, but the computational efficiency of GARFMDA will be reduced when pf is set to a large number, therefore, we will set the size of decision trees in both layers of the bilayer random forest to 250 during experiments.Similarly, for the parameter of the number of training rounds of GAT, we found through experiments that its numerical size has little effect on the prediction performance of GARFMDA, so we will set it to 10. Furthermore, to make our model better, we will use these parameters that work best to evaluate GARFMDA, i.e., we will set lr to 0.01, dp to 0.4, s to 7 and l to 0.0012 in subsequent comparison experiments.

Comparison with state-of-the-art methods
To validate the predictive performance of GARFMDA, we will compare it with the following five representative approaches separately: (1) LAGCN [20]: which is a computational model for inferring unknown drug-disease associations based on graph convolutional networks and attention mechanisms (2) GSAMDA [21]: which is a microbe-drug association prediction model based on graph attention networks and sparse autoencoders (3) SCSMDA [22]: which aims to predict microbe-drug associations based on the structure-enhanced contrast learning and self-paced negative sampling strategies.(4) MDASAE [23]: which is a calculation method based on fusing multi-attention mechanisms with stacked autoencoders to detect possible microbial drug associations.( 5) LRLSHMDA [24]: which is a computational scheme by exploiting Laplace Regularised Least Squares to predict microbe-disease associations.
During experiments, we will adopt the AUC values, the Accuracy values and the F1-score values as performance indicators and compare all of these rival approaches under the framework of tenfold cross validation.Experimental results are shown in the following Table 2 and Fig. 4 respectively.From observing the Table 2, it is easy to see that GARFMDA can reach to the highest AUC value of 0.9794 ± 0.0012, while MDASAE comes in second with an AUC value of 0.9701 ± 0.0023, and LAGCN has the lowest  AUC value of 0.8544 ± 0.0042.As For the Accuracy values and F1-score values, GAR-FMDA can as well obtain the highest values of 0.9955 and 0.7106 respectively.Therefore, It is obvious that GARFMDA can achieve the best prediction performance among all these five competing models.

Case study
In this section, we will undertake case studies of two well-known medications and one well-known microbe to better illustrate the efficacy of GARFMDA.In experiments, we will choose the top 20 candidate microbes or drugs predicted by GARFMDA and search in PubMed (https:// pubmed.ncbi.nlm.nih.gov) for these candidate microbes or drugs to see if any publications had reported about them.Among them, the first drug we have chosen is ciprofloxacin, which is a synthetic second-generation quinolone antimicrobial drug with broad-spectrum antimicrobial activity and bactericidal efficacy, and can be used to treat illnesses caused by mycobacterium influenzae, escherichia coli, and pneumococcus specific polysaccharide [25].In both vitro and vivo studies of ciprofloxacin, a very low incidence of resistant microorganisms has been reported [26].
In addition, Alhajj et al. [27] developed a dry powder of ciprofloxacin for inhalation for treating cystic fibrosis lung infections.Golapudi et al. demonstrated that ciprofloxacin inhibits TNF-(α)-induced HIV secretion in U1 cells [28].Table 3 illustrates that there are 19 out of those top 20 predicted potential bacteria having been confirmed by published journals to be related to ciprofloxacin.
The second drug we have selected is moxifloxacin, a quinolone broad-spectrum antimicrobial that treats adults (≥ 18 years of age) suffering from respiratory tract infections, both upper and lower [29], as well as acute sinusitis [30], acute exacerbations of chronic bronchitis [31], community-acquired pneumonia [32], and skin and soft tissue infections [33].Januel et al. [34] studied the use of moxifloxacin to treat the genetic disorder spinal muscular atrophy (SMA).However, Inada et al. [35] found that moxifloxacin can induce aortic aneurysms and clips by increasing bone bridging proteins in mice.
Table 4 shows that there are 15 out of the top 20 predicted candidate microorganisms have been confirmed by published journals to be associated with moxifloxacin, demonstrating the value of GARFMDA for clinical drug application and the identification of possible drug-related bacteria.
The microorganism that we have selected is E. coli, a conditionally pathogenic bacterium that under certain conditions can cause gastrointestinal infections or a variety of localised tissue and organ infections such as urogenital infections in humans and a wide range of animals [36].Pathogenic E. coli can cause more than 16.01 billion cases of dysentery [37] and 1 million deaths annually, whereas non-pathogenic E. coli are part of the normal gut flora of healthy mammals and birds.For example, it is anticipated that the E. coli strain nissle will be utilized to cure human illnesses in addition to being utilized as a probiotic and therapeutic agent [38].As shown in Table 5, 15 out of the top 20 predicted drugs have been confirmed by published journals to be associated with the E. coli.

Conclusion and discussion
In this paper, we developed a new prediction model called GARFMDA by combining a two-layer GAT with a two-layer random forest to detect possible drug-microbe correlations.Results of both comparison experiments and case studies showed that GARFMDA exceeded these state-of-the-art competitive prediction models.Naturally, GARDFMDA can also be adopted to solve other problems involving the association prediction of biological entities, such as the prediction of associations between diseases and circRNA and microbes.Of course, GARFMDA can yet be improved.For instance, we can add more biological data, like microbial sequencing information, to the feature selection section [9].Additionally, because the dataset is sparse, the model frequently results in the overfitting phenomena.To address this issue, we can also think about data augmentation.Moreover, the public database is not updated in real time, which may affect the way that the model is used in practice, therefore, we might consider to reconstruct an extensive database in the future.

Fig. 3
Fig. 3 Effects of parameters on performance of GARFMDA.a and b show the AUC values achieved by GARFMDA with different learning and abandonment rates of GAT, respectively.c and d illustrate the AUC values achieved by GARFMDA under different maximum depths of decision trees and contribution values of selected features in the bilayer random forest, separately

Table 1
Specifics of the newly-downloaded dataset

Table 2
AUC values, Accuracy values and F1-score values obtained by GARFMDA and five competing methods under the framework of tenfold CV on MDAD

Table 3
The top 20 predicted candidate ciprofloxacin-associated bacteria.In this table, the first column lists the top 10 predicted microbes, while the third column lists the top 11 to 20 predicted microbes

Table 4
The top 20 predicted candidate moxifloxacin-associated bacteria.In this table, the first column lists the top10 predicted microbes, while the third column lists the top 11 to 20 predicted microbes

Table 5
The top 20 forecasted drugs linked to E. coli.In this table, the first column lists the top 10 predicted drugs, while the third column lists the top 11 to 20 predicted drugs